BYOL for Audio: Exploring Pre-Trained General-Purpose Audio Representations

نویسندگان

چکیده

Pre-trained models are essential as feature extractors in modern machine learning systems various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features the input sound. For recognizing sounds regardless perturbations such varying pitch or timbre, be to these perturbations. serving diverse needs recognition emotions music genres, information, local and global features. To implement our principle, propose a self-supervised method: Bootstrap Your Own Latent (BYOL) Audio (BYOL-A, pronounced “viola”). BYOL-A pre-trains sound invariant data augmentations, which makes learned sounds. Whereas encoder combines calculates their statistics make representation multi-aspect information. As result, information serve tasks. We evaluated task performance compared previous state-of-the-art methods, demonstrated generalizability with best average result 72.4% VoxCeleb1 57.6%. Extensive ablation experiments revealed architecture contributes most performance, final critical portion resorts BYOL framework augmentations. Our code is available online future studies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploring Features for Audio

متن کامل

Tahakum: A Multi-Purpose Audio Control Framework

We present “Tahakum”, an open source, extensible collection of software tools designed to enhance workflow on multichannel audio systems within complex multi-functional research and development environments. Tahakum aims to provide critical functionality required across a broad spectrum of audio systems usage scenarios, while at the same time remaining sufficiently open as to easily support mod...

متن کامل

Feature Representations for Neuromorphic Audio Spike Streams

Event-driven neuromorphic spiking sensors such as the silicon retina and the silicon cochlea encode the external sensory stimuli as asynchronous streams of spikes across different channels or pixels. Combining state-of-art deep neural networks with the asynchronous outputs of these sensors has produced encouraging results on some datasets but remains challenging. While the lack of effective spi...

متن کامل

Representations of ambient audio for the Deaf

متن کامل

Watermarking parametric representations for synthetic audio

This paper proposes to watermark parametric representations for synthetic audio. Our watermark system combines quantization index modulation at the encoder and maximum likelihood parameter estimation at the decoder. To guarantee error-free data hiding under expected types of attacks, knowledge of Fisher information and Cramér-Rao bounds is applied to the system design. Experiments show that, me...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE/ACM transactions on audio, speech, and language processing

سال: 2023

ISSN: ['2329-9304', '2329-9290']

DOI: https://doi.org/10.1109/taslp.2022.3221007